One of the fundamental concepts of regression/classification models is that, for the most part, observations with similar input will be assigned similar output. The k-Nearest Neighbours algorithm takes this idea to the extreme by assigning a new observation (in the regression case) to be the average of the k observations closest to it in the training set.
#Import modules
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
n = 200 #Number of observations
X = np.random.uniform(0,5,n)
Y = np.array([x*np.sin(x**2) for x in X]) + np.random.normal(0,1,n)
data = pd.DataFrame({'X':X, 'Y':Y})
plt.figure(figsize = (10,6))
plt.scatter(X,Y)
plt.show()
class kNN:
def __init__(self, data, target, features, trainTestRatio = 0.9):
self.target = target
self.features = features
#Split up data into a training and testing set
self.train, self.test = train_test_split(data, test_size=1-trainTestRatio)
#There's no fitting process for k-NN, per se. To make predictions we simply examine the training set
#and find the closest examples we can
def predictSingleExample(self, x, k = 5):
#For a given example x, we find the k examples in the training set where the features are closest
#in terms of euclidean distance
trainingData = self.train.copy()
#Compute distance from x to each element of training set
trainingData['distance'] = [np.linalg.norm(np.array(x[self.features]) - np.array(row[self.features])) for idx, row in trainingData.iterrows()]
#sort by distance and then take the first k
closestk = trainingData.sort_values('distance', ascending = True).head(k)
#Now return the mean of those closest k
return closestk[self.target].mean()
def predict(self, X = None, k = 5):
#Predict the values for a dataframe of values (in the same format as the training set)
#Returns a list of the predicted values
#if no X provided, predict values of the test set
if X is None:
X = self.test
return [self.predictSingleExample(x, k) for idx, x in X.iterrows()]
mykNN = kNN(data, 'Y', ['X'])
mykNN.test['Pred'] = mykNN.predict(k = 5)
Plot the results
f = plt.figure(figsize=(15,6))
ax = f.add_subplot(121)
ax2 = f.add_subplot(122)
ax.scatter(mykNN.test['X'], mykNN.test['Y'] - mykNN.test['Pred'])
ax.set_xlabel('X')
ax.set_ylabel('Y - Pred')
ax.set_xlim([0, 5])
ax.set_ylim([-5,5])
ax2.scatter(mykNN.test['Y'], mykNN.test['Pred'], label = 'True values vs Predicted Values')
ax2.plot(np.arange(-5,5,0.1), np.arange(-5,5,0.1), color = 'green', label = 'Line y = x')
ax2.set_xlim([-5, 5])
ax2.set_ylim([-5,5])
ax2.set_xlabel('True Label')
ax2.set_ylabel('Predicted Label')
ax2.legend()
plt.show()
The left plot shows the residuals plotted against out input feature, X. We can see that overall, the residuals are scattered about 0, indicating that in general the model did a decent job of capturing the relationship between the feature and target. The residuals exhibit slighly more variance when $x > 3$, but this is not particularly surpising as the target oscillates for large $x$.
If we examine the plot on the right, we can see that the scatter plot of the true vs predicted values generally adheres to the line line $y = x$, indicating the model is performing as we would want.
For a conceptually simple algorithm, k-nearest neighbours has worked surprisingly well on this example - far better, for example, than we would expect linear regression to, due to the highly non-linear feature-target relationship.
Whilst k-nearest neighbours has performed well on a small dataset such as this one - it tends to scale poorly to larger datasets: Larger datasets demand more memory to store and it takes longer to find the nearest neighbours of a new example. Additionally, k-nearest neighbours suffers from the 'Curse of Dimensionality', meaning that its performance tends to degrade substantially as the number of features is increased.
k is a hyperparameter we can tune and should be thought of as a regulariser. The larger the value of k, the more training examples we use in the prediction and therefore the model will be more robust to slight perturbations as we increase k.
Let's plot the regression line for a variety of different values of k
plt.figure(figsize = (10,6))
dataRegLine = pd.DataFrame({'X':np.linspace(0,5,100)})
for k in [1, 5, 10, 30, 100, 180]:
dataRegLine[f'Pred{k}'] = mykNN.predict(X = dataRegLine[['X']], k = k) #Obtain predictions
plt.plot(dataRegLine['X'], dataRegLine[f'Pred{k}'], label = f'{k}-Nearest Neighbours') #Plot regression line
plt.legend()
plt.scatter(mykNN.train['X'], mykNN.train['Y'])
plt.show()
The plot above shows the effect of changing the value of k. Having k = 1 yields the most non-linear regression line, but has the disadvantage that it essentially just connect the lines between training points. Conversely setting k = 180 returns the same value for all inputs, as we take the average of the whole training set each time we make a prediction. Intermediate values of k exhibit varying degrees of non-linearity, in line with what we might expect.